Exercise 1: Getting started

Why am I here?

Because Data Science is cool, and Open Data Science is even more so!
I’m quite familiar with R (5 years of experience), but I’d also like to learn how to use GitHub and RMarkdown.

What did I do?

  1. Created a GitHub account and got myself the course project template.
  2. Tried to work with Git on my business laptop.
  3. Bought and installed a new laptop, as my old one was broken, and I couldn’t connect to GitHub on my business laptop.
  4. Installed R and RStudio.
  5. Installed GitHub Desktop and cloned the course repository.
  6. Wrote this description, set up my course diary and changed its theme to “paper”.
  7. Created and knitted README.Rmd.
  8. Uploaded the changes to GitHub.

What did I learn?

How to get started with Git, GitHub and RMarkdown.

Obligatory cat pic

Working process was approved by the cat.

Working process was approved by the cat.


Exercise 2: Regression and model validation

Description of the data

The dataset consists of data on the relationship between students’ learning approaches and their achievements in an introductory statistics course in Finland. More information about the original dataset is available at http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS2-meta.txt.

After manipulation, the dataset only has the following variables:

* gender     Gender: M (Male), F (Female)
* age        Age (in years) derived from the date of birth
* attitude   Global attitude toward statistics
* deep       Average of points from questions related to deep learning approach
* stra       Average of points from questions related to strategic learning approach
* surf       Average of points from questions related to surface learning approach
* points     Exam points

Students with zero exam points were removed from the dataset.

Data manipulation is described in detail here.

Reading the data into RStudio

The data can be read into RStudio eg. with the read.table command as follows:

students2014 <- read.table("./data/learning2014.txt", sep="\t", dec=".", header=T)

Now the data is stored in the R object students2014.

Explorative analysis

How does the manipulated data look? A table of summary statistics is often a good starting point for an explorative analysis.

library(psych)
describe(students2014)
##          vars   n  mean   sd median trimmed  mad   min   max range  skew
## gender*     1 166  1.34 0.47   1.00    1.30 0.00  1.00  2.00  1.00  0.68
## age         2 166 25.51 7.77  22.00   23.99 2.97 17.00 55.00 38.00  1.89
## attitude    3 166 31.43 7.30  32.00   31.52 7.41 14.00 50.00 36.00 -0.08
## deep        4 166  3.68 0.55   3.67    3.70 0.62  1.58  4.92  3.33 -0.50
## stra        5 166  2.79 0.53   2.83    2.78 0.62  1.58  4.33  2.75  0.14
## surf        6 166  3.12 0.77   3.19    3.14 0.83  1.25  5.00  3.75 -0.11
## points      7 166 22.72 5.89  23.00   23.08 5.93  7.00 33.00 26.00 -0.40
##          kurtosis   se
## gender*     -1.54 0.04
## age          3.24 0.60
## attitude    -0.48 0.57
## deep         0.66 0.04
## stra        -0.27 0.04
## surf        -0.45 0.06
## points      -0.26 0.46

A boxplot is a graphical tool for examining the distributions of individual variables. In this case, there is quite a lot of variation in the ranges of individual variables, and the picture is not as informative as one could wish.

boxplot(students2014)

We can draw a pairplot to check if there are associations between the variables.

library(GGally)
library(ggplot2)
p <- ggpairs(students2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

In order to check the gender distribution of students, we can use the table command. It seems that women are more interested in statistics than men! (Or perhaps there were more women to whom it was mandatory to take the course…)

table(students2014$gender)
## 
##   F   M 
## 110  56

The pairplot suggests that the variable points might be slightly bimodal, but in a histogram there seems to be no severe deviation from normality.

hist(students2014$points, main="Distribution of exam points", xlab="points")

All in all, the explorative analysis tells us that

  • Gender is a binary variable, and the other variables are continuous.
  • Age is skewed to the left, as the typical age of university students is 20 something, but there are also a couple of older students.
  • There might be an association between a student’s attitude toward statistics and his/her exam points.
  • Exam points is reasonably normally distributed, and we can use a linear regression model to model the association between a student’s attitude and his/her exam points.

Regression analyses

Let’s fit a regression model and check, if the suggested association between a student’s attitude and his/her exam points is statistically sound. Clearly, the exam points is the responsive variable and the attitude is an explanatory variable. Adding multiple explanatory variables into the model might reveal more associations between the variables. So let’s include also gender and surf as explanatory variables.

fit1 <- lm(points ~ gender + attitude + surf, data = students2014)
summary(fit1)
## 
## Call:
## lm(formula = points ~ gender + attitude + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.7179  -3.3285   0.5343   3.7412  10.9007 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97982    2.40303   3.737 0.000258 ***
## genderM     -0.22362    0.92477  -0.242 0.809231    
## attitude     0.35100    0.05956   5.893 2.13e-08 ***
## surf         0.89107    0.54409   1.638 0.103419    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.304 on 162 degrees of freedom
## Multiple R-squared:  0.2051, Adjusted R-squared:  0.1904 
## F-statistic: 13.93 on 3 and 162 DF,  p-value: 3.982e-08

The attitude is the only statistically significant (p < 0.05) explanatory variable (among the three that were tested) for the exam points. The other two are not statistically significant (p > 0.05), ie. we don’t have sufficient evidence to reject the null hypothesis that the regression coefficients of gender and surf (column Estimate) differ from zero. We should note that the borderline p-value of 0.05 is only a coarse rule of thumb, and cases on the borderline should be examined more carefully.

Now we need to refit the model to remove gender and surf.

fit2 <- lm(points ~ attitude, data = students2014)
summary(fit2)
## 
## Call:
## lm(formula = points ~ attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The explanatory variable attitude is statistically significant (p < 0.001), so we can reject the null hypothesis that attitude has no effect on exam points. Also the intercept (constant) term of the model, which is the expected mean value of exam points when the value of attitude is zero, is statistically significant. There is only one explanatory variable left in this model, so we don’t need to worry about potential collinearity between explanators. The estimate (coefficient) of attitude (0.35) means that when the attitude increases by 1 unit, the student’s exam score is expected to increase by 0.35 units on average. The multiple R-squared value indicates that the model explains about 19 % of variation in the response variable. The adjusted R-squared value is basically the same thing, but it accounts for the phenomenon of the R2 automatically increasing when extra explanatory variables are added to the model.

We can now visualize the regression model on the observed data as a red line:

plot(students2014$attitude, students2014$points, xlim = c(0, max(students2014$attitude)), xlab="Attitude toward statistics", ylab="Exam points")
abline(fit2, col="red")

Note to self: Using the p-values of individual model coefficients is not the recommended way of model selection, and it might be a better idea to calculate eg. an AIC value for each model. However, the results are the same in this case, as a smaller AIC means a better model fit.

paste0("AIC for model 1: ", AIC(fit1))
## [1] "AIC for model 1: 1030.9779677774"
paste0("AIC for model 2: ", AIC(fit2))
## [1] "AIC for model 2: 1029.9875233202"

Model diagnostics

Before declaring that a student’s attitude toward statistics is a good predictor for his/her exam performance, we have to make sure that the critical assumptions of linear regression model are met. The assumptions of the model are:

  1. The errors are normally distributed.
  2. The errors are not correlated.
  3. The errors have a constant variance (homoscedasticity).
  4. The size of a given error does not depend on the explanatory variables.

Let’s produce a couple of diagnostic plots to check these assumptions.

par(mfrow=c(2,2))
plot(fit2, which=c(1,2,5))

The first plot (residuals vs. fitted values) shows no detectable pattern in the residuals, indicating that assumptions 3 and 4 are valid. Also the assumption 1 seems to be valid, as the residuals align quite neatly along or near the straight line in the QQ-plot. No influential outliers show up in the last plot. The assumption 2 cannot be assessed based on these diagnostic plots, but it is mostly to be worried if there is a possibility of temporal or spatial autocorrelation in the data (eg. time series or spatial data). Thus, the assumptions of the linear regression model seem to be valid for our model.

Summary

There is a statistical association between a student’s attitude toward statistics and his/her exam performance: the more positive the attitude, the better the expected performance. This association can be modeled with a linear regression model.


Exercise 3: Logistic regression

Description of the data

The joined data set used in this analysis exercise combines two student alcohol consumption data sets. The following adjustments have been made in the data wrangling exercise:

  • The variables not used for joining the two data have been combined by averaging (including the grade variables)
  • ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’
  • ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise

More info: Original data source and R-code for data manipulation

First, let’s read the data into R and list the variable names:

alc <- read.table("./data/alc.csv", sep=";", dec=".", header=T)
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Hypotheses

The purpose of the analysis was to study the relationships between high/low alcohol consumption and some of the other variables in the data. I chose the following variables for the analysis and set these hypotheses:

Response variable:

  • high_use - Boolean response variable describing the level of alcohol consumption

Explanatory variables:

  • sex - males’ alcohol consumption is on average higher than females’
  • Pstatus - parents’ cohabitation reduces risk for high alcohol consumption
  • absences - positive statistical association between absences and high alcohol consumption
  • G3 (final grade) - better grade, lower alcohol consumption

Next, I limited the data used for the analysis to include only the five variables mentioned above.

library(dplyr)
dd <- select(alc, one_of(c("sex", "Pstatus", "absences", "G3", "high_use")))
str(dd)
## 'data.frame':    382 obs. of  5 variables:
##  $ sex     : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ absences: int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G3      : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ high_use: logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

Explorative analysis

Now we have some hypotheses produced with the Stetson-Harrison method. Will the data shoot them down straight away? Let’s see and produce some summaries and explorative plots of the variables and their associations.

First, a basic summary table:

# Basic summary table
library(psych)
describe(dd, skew=F)
##           vars   n  mean   sd min  max range   se
## sex*         1 382  1.48 0.50   1    2     1 0.03
## Pstatus*     2 382  1.90 0.30   1    2     1 0.02
## absences     3 382  4.50 5.46   0   45    45 0.28
## G3           4 382 11.46 3.31   0   18    18 0.17
## high_use*    5 382   NaN   NA Inf -Inf  -Inf   NA

Sex and parents’ cohabitation status are binary factors, and their associations to high alcohol use are easy to present as crosstabulations. Absences and final grade are integer variables with a wide range of values and not so suitable for crosstabulation.

tab1 <- table(high_use=dd$high_use, sex=dd$sex)
tab2 <- table(high_use=dd$high_use, parents_status=dd$Pstatus)
tabp1 <- prop.table(tab1, 2)
tabp2 <- prop.table(tab2, 2)
list(addmargins(tab1), tabp1,
     addmargins(tab2), tabp2)
## [[1]]
##         sex
## high_use   F   M Sum
##    FALSE 156 112 268
##    TRUE   42  72 114
##    Sum   198 184 382
## 
## [[2]]
##         sex
## high_use         F         M
##    FALSE 0.7878788 0.6086957
##    TRUE  0.2121212 0.3913043
## 
## [[3]]
##         parents_status
## high_use   A   T Sum
##    FALSE  26 242 268
##    TRUE   12 102 114
##    Sum    38 344 382
## 
## [[4]]
##         parents_status
## high_use         A         T
##    FALSE 0.6842105 0.7034884
##    TRUE  0.3157895 0.2965116

We might visualize these associations with stacked proportional bar plots:

layout(matrix(c(1,1,2,2,3), nrow=1, ncol=5, byrow=T))
par(mai=c(0.6,0.3,0.6,0.3))
barplot(tabp1, main="Sex vs high use", col=c("#F8766D","#00BFC4"))
barplot(tabp2, main="Pstatus vs high use", col=c("#F8766D","#00BFC4"))
par(mai=c(0.7,0.1,0.7,0.1))
plot.new()
legend("topleft", legend=c("FALSE","TRUE"), fill=c("#F8766D","#00BFC4"), title="High use")

The associations between absences and alchol use, and final grade and alcohol use can be visualized more easily with boxplots:

library(ggplot2)
library(gridExtra)
g1 <- ggplot(dd, aes(x = high_use, y = G3, fill=high_use)) + ggtitle("Grade vs high use")
g2 <- ggplot(alc, aes(x = high_use, y = absences, fill=high_use)) + ggtitle("Absences vs high use")
bp3 <- g1 + geom_boxplot() + ylab("grade") + theme(legend.position="none", plot.title = element_text(hjust = 0.5))
bp4 <- g2 + geom_boxplot() + ylab("absences") + theme(legend.position="none", plot.title = element_text(hjust = 0.5))
grid.arrange(bp3, bp4, ncol=2, nrow=1)

Based on the explorative analysis, males seem to be more prone to drink heavily than females. Heavy drinkers do get on average lower grades than those who don’t drink much. Heavy drinkers are also slightly more inclined to have a high number of absences than others. However, there seems to be no association between parents’ cohabitation status and student’s alchol use. It seems that my hypotheses about sex’s, absences’ and final grade’s associations with alcohol consumption might get some support from the explorative analysis.

Logistic regression model

Let’s fit a logistic regression model to see if these findings hold.

# First model
fit1 <- glm(high_use ~ sex + Pstatus + absences + G3, family=binomial, data=dd)
summary(fit1)
## 
## Call:
## glm(formula = high_use ~ sex + Pstatus + absences + G3, family = binomial, 
##     data = dd)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2443  -0.8411  -0.6244   1.0629   2.1686  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.98141    0.61643  -1.592   0.1114    
## sexM         0.98330    0.24149   4.072 4.67e-05 ***
## PstatusT     0.01709    0.40418   0.042   0.9663    
## absences     0.09273    0.02316   4.003 6.24e-05 ***
## G3          -0.07570    0.03584  -2.112   0.0347 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 425.51  on 377  degrees of freedom
## AIC: 435.51
## 
## Number of Fisher Scoring iterations: 4

The equation for the model is of form log(p/(1-p)) = a + b*sex + c*Pstatus + d*absences + e*G3 where p is the estimated probability for high alcohol use and a - e are the model coefficients. log(p/(1-p)) is the logit function of p. The coefficient values for factor variables (sex and Pstatus) are calculated separatedly for each unique factor level so that one level is chosen as the base level. The coefficients presented above denote the difference between the coefficient of the marked level and the base level. So the coefficient for the sex level F (chosen as the base level and omitted from the table) is actually the intercept -0.98, and the coefficient for the sex level M is -0.98141 + 0.98330 = 0.00189 (cf. explanation here).

The coefficients are easier to interpret if transformed to odd ratios (ORs) by exponantiation: OR = epx(b), where b is the original coefficient (OC). The odd ratio of a coefficient denotes the change of the odd p/(1-p), when the value of the explanatory variable changes by one unit. We can also use exponentiation and confint-function to calculate the confidence intervals for the odd ratios on the confidence level of 0.95.

data.frame(OR=round(exp(summary(fit1)$coef[,1]), 3),
           CI=round(exp(confint(fit1)), 3),
           "p-value"=round(summary(fit1)$coef[,4], 3))
##                OR CI.2.5.. CI.97.5.. p.value
## (Intercept) 0.375    0.109     1.228   0.111
## sexM        2.673    1.675     4.325   0.000
## PstatusT    1.017    0.472     2.331   0.966
## absences    1.097    1.051     1.151   0.000
## G3          0.927    0.864     0.994   0.035

Interpretation of OR values:

  • If OR > 1 (ie. OC > 0), an increase in the explanatory variable increases the response probability p.
  • If 0 < OR < 1 (ie. OC < 0), an increase in the explanatory variable decreases the response probability p.
  • If OR = 1 (ie. OC = 1), the explanatory variable has no effect on the response variable.
  • (OR is never negative, and OR for a coefficient can’t be exactly zero.)

We can check with a likelihood ratio test if the factor variables sex and Pstatus as a whole are statistically significant explanators in our fitted model.

anova(fit1, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: high_use
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                       381     465.68              
## sex       1  14.7320       380     450.95 0.0001239 ***
## Pstatus   1   0.1157       379     450.83 0.7337978    
## absences  1  20.8326       378     430.00 5.012e-06 ***
## G3        1   4.4902       377     425.51 0.0340903 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It seems that sex, absences and final grade (G3) are statistically significant explanators for alcohol use. The directions of their effects are also as I hyphothesized: probability for high alcohol usage increases with higher number of absences, decreases with better grades, and is higher for males than females. My hypothesis about parents’ cohabitation was however wrong, as cohabitation is not a statistically significant explanator in the model.

Assessing model performance

The variable Pstatus can be omitted from the model, as it is not statistically significant. Let’s refit the model:

fit2 <- glm(high_use ~ sex + absences + G3, family=binomial, data=dd)
data.frame(OR=round(exp(summary(fit2)$coef[,1]), 3),
           CI=round(exp(confint(fit2)), 3),
           "p-value"=round(summary(fit2)$coef[,4], 3))
##                OR CI.2.5.. CI.97.5.. p.value
## (Intercept) 0.381    0.153     0.925   0.035
## sexM        2.674    1.676     4.326   0.000
## absences    1.097    1.051     1.150   0.000
## G3          0.927    0.864     0.994   0.033
anova(fit2, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: high_use
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                       381     465.68              
## sex       1  14.7320       380     450.95 0.0001239 ***
## absences  1  20.8782       379     430.07 4.894e-06 ***
## G3        1   4.5585       378     425.51 0.0327561 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looks good, but how does the model perform as a binary classificator? We can use the model to predict the probability of high alcohol consumption for the individuals in the training data, and then dicotomize these probabilities to produce a binary prediction for the outcome. For this, we need to set a cutoff probability for a positive outcome (= high alcohol consumption). Choosing a sensible cutoff is an art on its own right, but let’s simply use the treshold of 0.5 (ie. the probabilities greater than 0.5 are interpreted to indicate high alcohol consumption). This is probably not a good choice, and it would be a better idea define the cutoff value with eg. a ROC curve.

# Predicting the training data
pred <- predict(fit2, type="response")
dd$prob <- pred
pred2 <- ifelse(pred > 0.5, TRUE, FALSE)
dd$pred <- pred2

Then we can produce a confusion matrix, which is a two-way crosstabulation of observed versus predicted values for the outcome variable, and visualize it.

# Confusion matrix
conf <- table(obs=dd$high_use, pred=pred2)
addmargins(conf)
##        pred
## obs     FALSE TRUE Sum
##   FALSE   256   12 268
##   TRUE     86   28 114
##   Sum     342   40 382
prop.table(conf)
##        pred
## obs          FALSE       TRUE
##   FALSE 0.67015707 0.03141361
##   TRUE  0.22513089 0.07329843
# Same as a plot
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(dd, aes(x = prob, y = high_use, col = pred2))
# define the geom as points and draw the plot
g + geom_point()

From the confusion matrix we can compute that the total proportion of inaccurately classified individuals (=the training error of the model) is (86 + 12) / 382 = 0.26. Is this better than a simple guessing strategy that nobody is a high user (as it is more common to not drink heavily than to drink)?

The training error of the model:

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = dd$high_use, prob = dd$pred)
## [1] 0.2565445

The training error of the simple guessing:

# Simple guessing strategy: nobody is a high user
dd$pred3 <- F
loss_func(class = dd$high_use, prob = dd$pred3)
## [1] 0.2984293

The training error of the simple guessing strategy is only slightly bigger than the error of our model, so it seems that our model is not a very succesful predictor for alchol consumption.

Bonus: Cross-validation

Above, we assessed the model performance based on the model’s ability to correctly predict the same data as on which the model was trained. This is not a good idea, as the model generally fits the training data better than any unseen data. To avoid this overestimation, we can perform cross-validation to get a more realistic estimate of the actual predictive power of the model. Let’s perform 10-fold cross-validation, which means that we divide the data into 10 complementary parts. One part forms the testing set, and the other 9 parts another subset called training set. We perform the analysis on the training set, and then validate the results on the smaller testing set. We repeat this process so that eventually all of the data is used for both training and testing, and then calculate the mean error rate of our model as an average of all repetitions.

library(boot)
cv <- cv.glm(data = dd, cost = loss_func, glmfit = fit2, K = 10)
cv$delta[1]
## [1] 0.2696335

We can see that the cross-validated error rate of our model actually is higher than the training error of the model (0.257). The test set performance of the model is about the same as of the model fitted in the DataCamp exercise.


Exercise 4: Clustering and classification

Description of the data

The dataset used in this exercise contains housing data in the suburban Boston. This is a built-in dataset in the R package MASS, and can be loaded with the data(Boston) command. Let’s check the variables and the structure of the data:

# Load Boston data
library(dplyr)
library(MASS)
data(Boston)

# Structure and dimensions
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

There are 506 rows and 14 columns in the data. The dataset is described in more details in the help file.

Explorative analyses

How does the data look like? Let’s produce summaries and a graphical overview of the variables:

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

All variables are numerical/integer except chas, which is a binary dummy variable indicating the location with respect to Charles River. The variables vary in scale.

# Pairplot
pairs(Boston, cex=0.1, pch=16)

Visualization of correlations between variables:

# Correlation matrix and visualization
library(tidyr)
library(corrplot)
cor_matrix<-cor(Boston) 
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)

There seems to be a number of linear associations between the variables as shown in the pairplot, eg. between the number of rooms (rm) and the median value of owner-occupied real estates (medv), and between the proportion of owner-occupied units built prior to 1940 (age) and nitrogen oxides concentration (nox), to name but a few.

Data manipulation

Let’s standardize the dataset for further analyses. This means that we subtract the column means from the corresponding columns and divide the difference with standard deviation. After standardization, all the (scaled) variables are centered around zero, and their unit is one standard deviation.

# Scaling
bs <- as.data.frame(scale(Boston))
summary(bs)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Let’s transform the crime rate (crim) into a categorical variable. We want to cut the variable by quantiles to get the high, low and middle rates of crime into their own categories. This is achieved by using the quantiles as break points.

# Transform 'crim' to a categorical variable
bs$crim <- cut(bs$crim, breaks = quantile(bs$crim), include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# Look at the table of the new factor crim
table(bs$crim)
## 
##      low  med_low med_high     high 
##      127      126      126      127

Then we need to divide the dataset to train and test sets for futher analyses. Splitting the original data to test and train sets allows us to check how well our models work. We make the division here in such a way that 80 % on the data belongs to the train set (and 20 % to the test set).

ind <- sample(nrow(bs), nrow(bs)*0.8)
train <- bs[ind,]
test <- bs[-ind,]

Linear discriminant analysis (LDA)

Linear discriminant analysis (LDA) is a classification method that can be used to

  • Find the variables that separate the classes best
  • Predict the classes of new data
  • Dimension reduction.

The target variable of LDA needs to be categorical.

Let’s try LDA on the newly created categorical crime rate variable (crim) to see if we can find combinations of variables that separate different crime rate classes from each other. This means that we use all the other variables in the dataset as predictor variables. The LDA is fitted on the train dataset.

# Linear discriminant analysis
set.seed(123)
lda.fit <- lda(formula=as.numeric(crim) ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(as.numeric(crim) ~ ., data = train)
## 
## Prior probabilities of groups:
##         1         2         3         4 
## 0.2673267 0.2301980 0.2574257 0.2450495 
## 
## Group means:
##            zn      indus        chas        nox          rm        age
## 1  0.99063148 -0.9050961 -0.16296517 -0.8639671  0.40854511 -0.8952370
## 2 -0.09028089 -0.3357776 -0.01832260 -0.5961645 -0.15291520 -0.4018061
## 3 -0.39571399  0.2384871  0.18195173  0.4317147 -0.01722921  0.4291779
## 4 -0.48724019  1.0171737 -0.07348562  1.0735101 -0.37869675  0.7956921
##          dis        rad        tax     ptratio      black       lstat
## 1  0.8712878 -0.6979446 -0.7316013 -0.41399394  0.3781200 -0.76870358
## 2  0.4208987 -0.5373033 -0.4699552 -0.09071536  0.3113432 -0.10393455
## 3 -0.3980661 -0.3932819 -0.2763409 -0.26815160  0.1205193  0.08029325
## 4 -0.8393230  1.6375616  1.5136504  0.78011702 -0.5481859  0.82818338
##          medv
## 1  0.46618631
## 2 -0.02905425
## 3  0.07473939
## 4 -0.69328787
## 
## Coefficients of linear discriminants:
##                 LD1           LD2         LD3
## zn       0.10227956  0.7233755877 -1.03591471
## indus    0.01414096 -0.3324760590  0.14998040
## chas    -0.07288697 -0.0758892758  0.13292261
## nox      0.32581245 -0.5425731577 -1.30124379
## rm      -0.09702408 -0.0344156820 -0.12054485
## age      0.30511189 -0.3060396099 -0.20909820
## dis     -0.09572531 -0.2357982958  0.32970448
## rad      2.98430397  0.9236590758 -0.08131541
## tax      0.03608205 -0.0325967724  0.74589200
## ptratio  0.07837874  0.0779243451 -0.35891926
## black   -0.09826692 -0.0006390227  0.10628839
## lstat    0.20805989 -0.2920412751  0.47698883
## medv     0.14196108 -0.4475367494 -0.13704067
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9394 0.0442 0.0164

We can see from the result that three linear discriminant functions were formed (number of crime categories minus one, which is the maximum possible number). These discriminant functions are linear combinations of original variables, and the coefficients for each function and variable pair can be seen in the table. From the proportion of trace we can see that the first dicriminant function LD1 explains 96 % of the between-group variance, and the other functions LD2 and LD3 explain 3.2 % and 1.1 % respectively.

We can visualize the results of LDA with a biplot. The original variables and their effects can be visualized as arrows on top of the biplot. First two LD functions are shown in the plot.

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0,
         x1 = myscale * heads[,choices[1]],
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads),
       cex = tex, col=color, pos=3)
}

# plot the lda results
classes <- as.numeric(train$crim)
plot(lda.fit, dimen = 2, col = classes)
lda.arrows(lda.fit, myscale = 2, col = "#666666")

The biplot reveals us that variable rad nearly perfectly separates the high crime rate class from other classes, as there is only one observation that mixes with lower classes. The other three lower crime rate classes don’t separate properly.

We can see how well the LDA model performs as a whole by predicting new data on the test set with it. Then we can calculate the total error of prediction by crosstabulating the results against the original data, similarly as was done when predicting data with a logistic regression model. (Note to self: there’s actually no need to remove the categorical crime rate variable from test data, but I did as was instructed.)

# save and remove categorical crime data from test data
crime <- test$crim
test <- dplyr::select(test, -crim)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = as.numeric(crime), predicted = lda.pred$class)
##        predicted
## correct  1  2  3  4
##       1 12  7  0  0
##       2  5 17 11  0
##       3  0  7 15  0
##       4  0  0  0 28

We can calculate from the confusion matrix that the total error rate is about 29 % in the test data. [The confusion matrix changes on every knitting because of randomization, so the error rate given here doesn’t probably match the matrix shown above!]. In other words, the model correctly classifies 71 % of new cases, which is not a bad rate compared to the expected success rate of 25 %, if randomly guessing one of four classes. The best classification performance was achieved in the high crime rate class.

Clustering with K-means

We used LDA to separate and predict known classes of individual observations. But what if we don’t know the classes in advance, but want to detect similar observations and assing them to groups or clusters based on their similarity? We can use clustering methods, eg. K-means algorithm, to achieve this.

First, we need to calculate the (dis)similarity or distance of individual observations. There are many different distance measures, but let’s use the Euclidean distance, which is the “normal” or most common distance measure.

# load MASS and Boston
data('Boston')
bs2 <- as.data.frame(scale(Boston))

# euclidean distance matrix
dist_eu <- dist(bs2)

K-means is an unsupervised clustering method that iteratively updates the cluster memberships of individual observations so that eventually similar observations belong to same clusters. We need to decide the number of clusters we want to have before running the K-means algorithm. Let’s start with 5 clusters:

km <-kmeans(dist_eu, centers = 5)

But wait! How do we know the optimal number of clusters? One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of clusters changes.

# determine the initial number of clusters
k_max <- 10

# calculate the total within sum of squares
set.seed(123)
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})

# visualize the results
plot(1:k_max, twcss, type='b')

The optimal number of clusters is when the total WCSS drops radically, so let’s rerun K-means with 2 clusters. We can visualize the results with a pairplot, where the clusters are separated with colors.

# Optimal number of clusters is when the total of within cluster sum of squares drops drastically
# -> 2 clusters
km <-kmeans(dist_eu, centers = 2)

# plot the scaled Boston dataset with clusters
pairs(bs, col = km$cluster,  cex=0.1, pch=16)

We can see from the pairplot that rad and tax are the two most informative variables to separate the two clusters from each other. Also nox seems to contain useful information for clustering.

Bonus

K-means clusters as targets of LDA

# k-means again
set.seed(123)
km2 <- kmeans(dist_eu, centers = 5)

# LDA with using the k-means clusters as target classes
bs2$cl <- km2$cluster
lda.fit2 <- lda(cl ~ ., data = bs2)
plot(lda.fit2, col=as.numeric(bs2$cl), dimen=2)
lda.arrows(lda.fit2, myscale = 3, col = "#666666")

Super-Bonus

3D visualizations with plotly

Color is defined by the categorical crime rate:

model_predictors <- dplyr::select(train, -crim)
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crim)

Color is defined by the clusters of the k-means:

train$cl <- bs2$cl[match(rownames(train), rownames(bs2))]
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=as.factor(train$cl))

Exercise 5: Dimensionality reduction techniques

Description of the data

The data used in this exercise is taken from the Human Development Report 2015 (see more info here).

Two different datasets, one on human development and another on gender inequality, were combined in the data wrangling exercise to include the following variables:

  • edu2FMrat: ratio of proportions of females/males with at least secondary education
  • labFMrat: ratio of proportions of females/males in labour force
  • expedu: expected years of schooling
  • lifexxp: life expectancy at birth
  • GNI: gross national income per capita
  • matmor: maternal mortality ratio
  • adbi: adolescent birth rate
  • repparl: percentage of female representatives in parliament

Observations (countries) with missing values were removed, and the manipulated data consist of 155 countries.

Let’s load the data.

# Load human data
library(dplyr)
library(corrplot)
library(tidyr)
load("./data/human.RData")

Explorative analyses

How does the data look like? Let’s produce summaries and a graphical overview of the variables:

# Structure and dimensions
str(human4)
## 'data.frame':    155 obs. of  8 variables:
##  $ edu2FMrat: num  1.007 0.997 0.983 0.989 0.969 ...
##  $ labFMrat : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ lifeexp  : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ expedu   : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : num  64992 42261 56431 44025 45435 ...
##  $ matmor   : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ adbi     : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ repparl  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
# Variable summaries and graphical overview
summary(human4)
##    edu2FMrat         labFMrat         lifeexp          expedu     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI             matmor            adbi           repparl     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
library(GGally)
library(ggplot2)
p <- ggpairs(human4, mapping = aes(alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

All the variables are numerical, but have varying scales, eg. edu2FMrat and labFMrat are ratios, repparl is a percentage, and GNI is a sum index. There are several strong correlations between variables, eg. a positive correlation of 0.76 beween adbi (adult birth rate) and lifeexp (life expectancy at birth), and a negative correlation of -0.86 between matmor (maternal mortality ratio) and lifeexp.

Principal Component Analysis (PCA)

Principal component analysis (PCA) is a dimension reduction technique that seeks to transform the data to principal components that are linear combinations of original variables. Let’s first try PCA on unstandardized data.

# PCA on unstandardized data
pca_human <- prcomp(human4)
pca_human
## Standard deviations:
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation:
##                     PC1           PC2           PC3           PC4
## edu2FMrat -5.607472e-06  0.0006713951 -3.412027e-05 -2.736326e-04
## labFMrat   2.331945e-07 -0.0002819357  5.302884e-04 -4.692578e-03
## lifeexp   -2.815823e-04  0.0283150248  1.294971e-02 -6.752684e-02
## expedu    -9.562910e-05  0.0075529759  1.427664e-02 -3.313505e-02
## GNI       -9.999832e-01 -0.0057723054 -5.156742e-04  4.932889e-05
## matmor     5.655734e-03 -0.9916320120  1.260302e-01 -6.100534e-03
## adbi       1.233961e-03 -0.1255502723 -9.918113e-01  5.301595e-03
## repparl   -5.526460e-05  0.0032317269 -7.398331e-03 -9.971232e-01
##                     PC5           PC6           PC7           PC8
## edu2FMrat -0.0022935252  2.180183e-02  6.998623e-01  7.139410e-01
## labFMrat   0.0022190154  3.264423e-02  7.132267e-01 -7.001533e-01
## lifeexp    0.9865644425 -1.453515e-01  5.380452e-03  2.281723e-03
## expedu     0.1431180282  9.882477e-01 -3.826887e-02  7.776451e-03
## GNI       -0.0001135863 -2.711698e-05 -8.075191e-07 -1.176762e-06
## matmor     0.0266373214  1.695203e-03  1.355518e-04  8.371934e-04
## adbi       0.0188618600  1.273198e-02 -8.641234e-05 -1.707885e-04
## repparl   -0.0716401914 -2.309896e-02 -2.642548e-03  2.680113e-03
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
biplot(pca_human, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"))

Doesn’t look very good, as there only seems to be one important variable.This is because PCA assumes that a large variance means an important variable. The variable GNI (gross national income per capita) has a variance much larger than the other variables, and thus gets way too much weight in PCA. We need to standardize the data to scale down the effect of different-sized variances, and redo PCA.

# Standardize and repeat
human_std <- scale(human4)
summary(human_std)
##    edu2FMrat          labFMrat          lifeexp            expedu       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##       GNI              matmor             adbi            repparl       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
pca_human_std <- prcomp(human_std)
pca_human_std
## Standard deviations:
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation:
##                   PC1         PC2         PC3         PC4        PC5
## edu2FMrat -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
## labFMrat   0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
## lifeexp   -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
## expedu    -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
## GNI       -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
## matmor     0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
## adbi       0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
## repparl   -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
##                   PC6         PC7         PC8
## edu2FMrat  0.17713316  0.05773644  0.16459453
## labFMrat  -0.03500707 -0.22729927 -0.07304568
## lifeexp   -0.42242796 -0.43406432  0.62737008
## expedu    -0.38606919  0.77962966 -0.05415984
## GNI        0.11120305 -0.13711838 -0.16961173
## matmor     0.17370039  0.35380306  0.72193946
## adbi      -0.76056557 -0.06897064 -0.14335186
## repparl    0.13749772  0.00568387 -0.02306476
summary(pca_human_std)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
biplot(pca_human_std, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "darkviolet"))

The standardization helped, and now the results look more reasonable. The first principal component explains 53 % of variation, and the first three principal components cumulatively explain almost 80 % of the variation in our data. So we have effectively reduced the number of variables needed to characterize the data set. Moreover, the principal components are always orthogonal (ie. don’t correlate with each other), which is handy if we would like to perform further analyses, e.g. linear regression, on them.

The first principal component (PC1) is a linear combination of other variables than female representation in labour force and in parliament (labFMrat and repparl, respectively), as these two variables are the only ones to have low loadings on PC1 (ie. don’t correlate with PC1). On the other hand, these two variables have high loadings on PC2, whereas other variables barely correlate with PC2. Female representation both in labour force and parliament positively correlates with PC2, so we can interpret PC2 to represent gender equality (most gender-equal countries are located on top of the biplot). Female participation in secondary education (edu2FMrat), life expectation (lifeexp), expected years of education (expedu) and gross national income per capita (GNI) have negative loadings on PC1, whereas maternal mortality rate (matmor) and adolescent birth rate (adbi) have positive loadings on it. This seems to imply that high value of PC1 (location on right of the biplot) signifies low development level of a country, combining social, educational and economic aspects of development.

Multiple Correspondence Analysis (MCA)

Next, we try another dimension reduction method on tea data from the package FactoMineR. This dataset includes 300 observations of people’s tea drinking habits.

# Load tea dataset
library(FactoMineR)
data(tea)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

The dataset includes almost exclusively factor variables, so we can’t now perform PCA, which requires numerical variables. There is a very high number of variables in the data, so let’s only choose a subset of them for further analysis.

library(tidyr)
tea2 <- tea[c("breakfast","tea.time","friends","Tea","How","sugar","sex","sophisticated")]
str(tea2)
## 'data.frame':    300 obs. of  8 variables:
##  $ breakfast    : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time     : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ friends      : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ Tea          : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How          : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar        : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ sex          : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ sophisticated: Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
gather(tea2) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Let’s perform a multiple correspondence analysis (MCA) on these 8 variables. MCA is a generalization of PCA that can be used to visualize categorical data on Euclidean space.

# MCA
res.mca <- MCA(tea2, graph=FALSE)
summary(res.mca, nbelements=Inf, nbind=5)
## 
## Call:
## MCA(X = tea2, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.194   0.175   0.170   0.136   0.120   0.115
## % of var.             14.076  12.747  12.366   9.927   8.740   8.337
## Cumulative % of var.  14.076  26.823  39.189  49.116  57.855  66.192
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.108   0.104   0.095   0.089   0.069
## % of var.              7.887   7.533   6.905   6.446   5.037
## Cumulative % of var.  74.079  81.612  88.517  94.963 100.000
## 
## Individuals (the 5 first)
##                      Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                 |  0.796  1.092  0.393 |  0.611  0.709  0.231 |  0.046
## 2                 |  0.259  0.115  0.035 |  1.012  1.947  0.538 |  0.238
## 3                 | -0.479  0.395  0.246 | -0.172  0.056  0.032 | -0.074
## 4                 |  0.687  0.812  0.465 | -0.406  0.313  0.162 |  0.132
## 5                 |  0.563  0.546  0.247 |  0.338  0.218  0.089 |  0.111
##                      ctr   cos2  
## 1                  0.004  0.001 |
## 2                  0.111  0.030 |
## 3                  0.011  0.006 |
## 4                  0.034  0.017 |
## 5                  0.024  0.010 |
## 
## Categories
##                       Dim.1     ctr    cos2  v.test     Dim.2     ctr
## breakfast         |  -0.023   0.016   0.000  -0.380 |   0.656  14.735
## Not.breakfast     |   0.021   0.015   0.000   0.380 |  -0.606  13.601
## Not.tea time      |   0.793  17.748   0.488  12.077 |  -0.226   1.587
## tea time          |  -0.615  13.757   0.488 -12.077 |   0.175   1.230
## friends           |  -0.302   3.855   0.172  -7.175 |  -0.181   1.522
## Not.friends       |   0.570   7.265   0.172   7.175 |   0.341   2.868
## black             |  -0.151   0.361   0.007  -1.490 |   1.070  20.158
## Earl Grey         |  -0.079   0.259   0.011  -1.835 |  -0.308   4.341
## green             |   0.800   4.546   0.079   4.863 |  -0.601   2.838
## alone             |  -0.057   0.136   0.006  -1.339 |  -0.298   4.105
## lemon             |  -0.104   0.077   0.001  -0.632 |  -0.401   1.262
## milk              |   0.375   1.906   0.037   3.342 |   0.907  12.326
## other             |  -1.011   1.981   0.032  -3.075 |   1.567   5.256
## No.sugar          |  -0.431   6.199   0.199  -7.706 |   0.225   1.866
## sugar             |   0.461   6.627   0.199   7.706 |  -0.241   1.995
## F                 |  -0.582  12.993   0.495 -12.162 |  -0.133   0.749
## M                 |   0.850  18.957   0.495  12.162 |   0.194   1.093
## Not.sophisticated |   0.360   2.367   0.051   3.910 |   0.548   6.069
## sophisticated     |  -0.142   0.936   0.051  -3.910 |  -0.217   2.399
##                      cos2  v.test     Dim.3     ctr    cos2  v.test  
## breakfast           0.397  10.899 |  -0.236   1.965   0.051  -3.920 |
## Not.breakfast       0.397 -10.899 |   0.218   1.814   0.051   3.920 |
## Not.tea time        0.039  -3.436 |   0.192   1.181   0.029   2.920 |
## tea time            0.039   3.436 |  -0.149   0.915   0.029  -2.920 |
## friends             0.062  -4.290 |  -0.317   4.828   0.189  -7.526 |
## Not.friends         0.062   4.290 |   0.598   9.099   0.189   7.526 |
## black               0.375  10.592 |   0.491   4.378   0.079   4.862 |
## Earl Grey           0.171  -7.143 |  -0.415   8.162   0.311  -9.647 |
## green               0.045  -3.656 |   1.328  14.256   0.218   8.072 |
## alone               0.164  -7.012 |   0.326   5.083   0.198   7.685 |
## lemon               0.020  -2.438 |  -1.340  14.510   0.222  -8.143 |
## milk                0.219   8.088 |  -0.351   1.899   0.033  -3.127 |
## other               0.076   4.766 |   0.300   0.199   0.003   0.913 |
## No.sugar            0.054   4.023 |   0.541  11.122   0.313   9.674 |
## sugar               0.054  -4.023 |  -0.578  11.889   0.313  -9.674 |
## F                   0.026  -2.780 |   0.077   0.260   0.009   1.612 |
## M                   0.026   2.780 |  -0.113   0.379   0.009  -1.612 |
## Not.sophisticated   0.119   5.958 |  -0.527   5.779   0.110  -5.727 |
## sophisticated       0.119  -5.958 |   0.208   2.285   0.110   5.727 |
## 
## Categorical variables (eta2)
##                     Dim.1 Dim.2 Dim.3  
## breakfast         | 0.000 0.397 0.051 |
## tea.time          | 0.488 0.039 0.029 |
## friends           | 0.172 0.062 0.189 |
## Tea               | 0.080 0.383 0.364 |
## How               | 0.063 0.322 0.295 |
## sugar             | 0.199 0.054 0.313 |
## sex               | 0.495 0.026 0.009 |
## sophisticated     | 0.051 0.119 0.110 |

The first two dimensions retain 27 % of variance in the data. The v-test values show that most categories have significant (abs(v.test) > 1.96) coordinates in the first three dimensions. Variables sex and tea time have highest correlations with the dimension 1, and variables breakfast, Tea and How have highest correlations with the dimension 2.

We can visualize the results with biplots using different options.

# MCA
plot(res.mca, invisible = c("var"))

plot(res.mca, invisible = c("ind"), habillage="quali")

The second biplot shows the individual categories plotted on (Euclidean) MCA space formed by first two MCA dimensions. Categories that are mutually similar are positioned close to each other in the biplot. We can see one clear outlier category, those who choose to drink their tea with “other” ways, on top of the plot. However, there are very few cases in this category, as could be seen on the barplot that was plotted earlier.

In the second biplot, dimension 1 separates females and males to different groups, as the category summary above implied. Females seem to prefer to drink their tea at tea time, with friends and with no sugar, whereas men are not that particular about drinking tea at tea time or with friends, and more often use sugar. Dimension 2 separates those drinking their tea on breakfast and those who do not drink tea on breakfast. The former group prefers drinking black tea with milk, whereas the latter chooses green tea or Earl Grey (weird! isn’t Earl Grey a black tea variant?) alone or with lemon.